NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Halawi, Daniel; Wei, Alex; Wallace, Eric; Wong, Tony Tong; Haghtalab, Nika; Steinhardt, Nika (July 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR)

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
more » « less
Full Text Available
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Halawi, Daniel; Wei, Alex; Wallace, Eric; Wong, Tony Tong; Haghtalab, Nika; Steinhardt, Nika (July 2024, Proceedings of the 41st International Conference on Machine Learning, PMLR)

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
more » « less
Full Text Available
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Min, Sewon; Gururangan, Suchin; Wallace, Eric; Shi, Weijia; Hajishirzi, Hannaneh; Smith, Noah; Zettlemoyer, Luke (May 2024, ICLR)

Full Text Available
Concealed Data Poisoning Attacks on NLP Models

https://doi.org/10.18653/v1/2021.naacl-main.13

Wallace, Eric; Zhao, Tony; Feng, Shi; Singh, Sameer (January 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)
null (Ed.)
Full Text Available
Gradient-based Analysis of NLP Models is Manipulable

https://doi.org/10.18653/v1/2020.findings-emnlp.24

Wang, Junlin; Tuyls, Jens; Wallace, Eric; Singh, Sameer (January 2020, Findings of the Association for Computational Linguistics: EMNLP 2020)

Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, the fact that they directly reflect the model internals. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions. This Facade Model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (sentiment analysis, NLI, and QA), we show that the merged model effectively fools different analysis tools: saliency maps differ significantly from the original model’s, input reduction keeps more irrelevant input tokens, and adversarial perturbations identify unimportant tokens as being highly important.
more » « less
Full Text Available
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

https://doi.org/10.18653/v1/D19-3002

Wallace, Eric; Tuyls, Jens; Wang, Junlin; Subramanian, Sanjay; Gardner, Matt; Singh, Sameer (October 2019, Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations)

Neural NLP models are increasingly accurate but are imperfect and opaque—they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit’s flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at https://allennlp.org/interpret.
more » « less
Full Text Available
Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Li, Zhuohan; Wallace, Eric; Shen, Sheng; Lin, Kevin; Keutzer, Kurt; Klein, Dan; Gonzalez, Joseph E. (January 2020, Proceedings of the International Conference on Machine Learning (ICML))
null (Ed.)
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
more » « less
Full Text Available
Misleading Failures of Partial-input Baselines

https://doi.org/10.18653/v1/P19-1554

Feng, Shi; Wallace, Eric; Boyd-Graber, Jordan (January 2019, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics)

Full Text Available
Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering

https://doi.org/10.1162/tacl_a_00279

Wallace, Eric; Rodriguez, Pedro; Feng, Shi; Yamada, Ikuya; Boyd-Graber, Jordan (March 2019, Transactions of the Association for Computational Linguistics)

Adversarial evaluation stress-tests a model’s understanding of natural language. Because past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human- in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human–computer matches: Although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering.
more » « less
Full Text Available
Universal Adversarial Triggers for Attacking and Analyzing NLP

https://doi.org/10.18653/v1/D19-1221

Wallace, Eric; Feng, Shi; Kandpal, Nikhil; Gardner, Matt; Singh, Sameer (January 2019, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP))

Full Text Available

« Prev Next »

Search for: All records